University of Konstanz – UKN”

VAST 2009 Challenge
Grand Challenge

Authors and Affiliations:

Peter Bak, University of Konstanz, bak@dbvis.inf.uni-konstanz.de   

Christian Rohrdantz, University of Konstanz, rohrdantz@dbvis.inf.uni-konstanz.de

Svenja Leifert, University of Konstanz, svenja.leifert@uni-konstanz.de

Christoph Granacher, University of Konstanz, christoph.granacher@uni-konstanz.de

Stefan Koch, University of Konstanz, stefanmoritzkoch@googlemail.com

Simon Butscher, University of Konstanz, simon.butscher@uni-konstanz.de

Patrick Jungk, University of Konstanz, patrick.jungk@uni-konstanz.de

Tool(s):

VAT – Video Analysis Tool, developed at the at the University of Konstanz 2009,

KNIME – Data Analysis and Visualization Tool, University of Konstanz: www.knime.org

Pajek, Network analysis program: http://pajek.imfm.si/doku.php

 

Video:

 

Traffic Video: traffic-video.wmv

Flitter Video: Flitter-Video.wmv

Video Video: Video-Video.wmv

 

 

ANSWERS:


GC.1: Please describe the scenario supported by your analysis of the three mini-challenges in a Debrief.

An employee of the US-Embassy in Flovania is leaking information to a criminal organization.  The employee with the stuff ID 30 is considered suspicious for several reasons. We have evidence that he/she sent large amounts of data from different computers in the embassy of fellow colleagues to the external IP-address 100.59.151.133. He/she apparently took advantage of the absence of colleagues and used their computers for his/her criminal activity. In some cases the absence of the fellow colleagues from their working places during the transactions to the mentioned IP-address is clearly documented: This we know from the entry-logs, which show that these fellow colleagues either had not yet arrived at the office in the morning, or were traceably in the classified area. In the remaining cases of transactions to the suspicious IP-Address we could gain evidence of their absence from their network traffic behavior. Gaps in the continuance of their normal network traffic suggest that they were making a break and therefore might have left their office - and even though the employees are not required to log out of the building, it is elusive from their network traffic when they stopped working in the evenings. In any case of transactions to the IP 100.59.151.133 one of the upper indications for the absence of the computer owner was given, and in almost any case the respective roommate apparently was absent, too. This is not true for two of the earliest suspicious transactions on 8th and 15th of January: The employee with ID 30 was in office and active while the computer of his roommate 31 was abused for criminal activities. This is only one of several indications for the guilt of employee 30. While studying the network-traffic behavior of the absent colleagues, we learned that both the amount of traffic produced and the destination IP were unusual for their behavior. Especially the amount of data sent is salient: Among the 16 network traffic events with the highest request sizes during the whole month of January, 13 were connections to the suspicious ID. This makes us confident that the identified IP-Address was the one that was used for leaking information. We also assessed the ‘alibi’ of all employees during the suspicious criminal activity, and found that there are only two employees that have no clearance for any of the times when data was sent to the mentioned IP-address, employee 30 and employee 27. The second suspect, with the stuff ID 27, was finally disburdened, because the first four criminal activities were conducted from the room mate of employee 30, while employee 30 was active at their office. It is also evident, that he/she used the office neighbor’s computer for the criminal activities repeatedly in the beginning, and then started to use others’ computers, in order to cover up his/her track. There is some more systematic behavior of the suspect. He/she always sent the large amounts of data to the same external IP address, on every week Tuesdays and Thursdays. He/she started with a single transaction on January 8th, and continued sending twice a day (10th,15th, 17th and 22nd) and finally three times a day (29th, 31st of January).  Seemingly he/she was either forced to raise the amount of material transmitted, or alternatively felt more secure. In addition, the network traffic of employee 30 always went up significantly immediately (1 to 2 minutes) after the suspicious activity. This high traffic goes to unsuspicious IP-addresses, which makes us believe that it is a diversionary tactic.

We were also able to uncover the communication channel this suspicious employee (ID 30) has to the criminal organization. The social networking/micro-blogging tool, Flitter, was used for the communication with three handlers. In flitter the employee has the ID 100 and the handlers have the IDs 194, 261, 563. The handlers then communicate with a person code named “Boris”, with Flitter ID 4994. Boris has a direct contact to the fearless leader (Flitter ID 4) of the criminal organization.  We are quite certain about this setting, since all other possible scenarios could be discarded. The employee and the three handlers live in Prounov, the second largest city of Flovania, which is closely located to the capital city Koul. Boris lives in Kannvic, in the east, and the fearless leader in Kouvnic, in the north of the country.  The international contacts of the leader are distributed in all surrounding countries (Tulamuk in Trium, Otello in Posana and Transpasko in Transak). The fact that the employee and the handlers live in the same city, makes us believe that they probably also met in person. May be at these occasions money, information or even objects were exchanged. Since the 24th of January the number of transmissions rose to three at each occasion, it is reasonable to believe that the compensations rose as well. This hypothesis is backed up by suspicious events that we detected during the analysis of some surveillance videos.

The analysis of the surveillance video from public places near the Embassy revealed a number of suspicious events. We declared events as suspicious, when two persons met or a person approached a vehicle. Among the large number of events, we closely looked those that are in the time frame of the activities uncovered by the network traffic analysis. We gathered video data from the 24th and 26th of January. On the 24th in morning there were several suspicious events of two persons meeting. Some of these events coincide with the time range when employee 30 was inactive. In particular: employee 30 came at 7:47am to office and started working at 8:05am. During this gap, a suspicious meeting took place at location 2 in the surveillance video starting at 8am and lasting for 1 minute. The next gap in his/her network traffic occurred between 8:06am and 9:09am, while he/she logged out from the classified area at 9:00am without logging in! During this time period, four suspicious events were recorded by the surveillance video: 8:14am for 1:47minutes and 8:37am for 1:11minute  at location 2, 8:41am for 1minute at location 4, and at 8:43 for 1 minute at location 3. Each of these events shows the meeting of two persons. At later time, between 12:09 to12:33am he/she was in logged in to the classified area. During this time period two suspicious video events occurred, at 12:21 for 23seconds and at 12:31 for 35seconds. As the employee already proved to be not reliable for logging in and out the classified area, we certainly believe that he/she might have faked to be in the classified area. For the remaining suspicious events in the video it is also possible that employee 30 might have left his/her work place for a short time. In any case, he/she is not active during these events, but the question remains how he/she managed to enter the embassy without logging in and how he/she could manage to manipulate the logging system of the classified area. In the former case, he/she might have piggybacked as other employees entered the building regularly during the morning.
The 26th of January was a Saturday - which is not a working day for the embassy employees - we detected several suspicious events as well. Therefore, the suspicious employee could easily have met a handler at a public location.


GC.2:  Who are the major players in the scenario and what are their relationships?

 

We conducted our analysis on three parallel tracks. First track investigated the suspicious computer use at the embassy. The second track investigated the Flitter data providing information about the social communication network of the criminal organization. The third track investigated the surveillance video data recording the surrounding the embassy. These tracks are described in detail in the following subchapters.

 

Network Traffic Analysis

 

The most characteristic trait of suspicious computer use we found is that the guilty employee uses PCs of workmates that have been absent from their offices.

Our process is a form of the KDD pipeline (see Picture1) with three main iterative phases. The data needed to be prepared (Data Preparation) and analysed with programs or visual analytics tools (Interaction). Then we could draw conclusions and gather new information (Knowledge).

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/traffic/traffic_8/UKN-KNIME-MC1/UKN-KNIME-MC1/Picture1.png
Figure 1: Network Traffic Analysis - Pipeline

 

 

In the Data Preparation phase we computed minutes per day and minutes per month. The proxLog and IPLog data tables were put together into an Overview table containing the employee IDs, types and time components, while „types“ is the „Type“ in the proxLog dataset and „Socket“ in the IPLog.

We needed about one hour to prepare the data as we were forced to undertake many small steps like splitting strings, changing data types etc.

In later iterations, this phase only contained different filterings or selections of the data (e.g. looking at IDs seperately).

Everything except the joining of the proxLog and IPLog data tables was done half-automatically: the preprocessing steps needed to be identified manually and were carried out by the system on the data tables.

We could then begin to search for anomalies.

Plotting minutes per month against minutes per day gives a nice overview over each person's data traffic (see Picture2). Colours are mapped to request sizes and red squares symbolise the largest amounts. This gave us a few suspicious IDs but no definite results. However, we found the same IDs in different situations later.

Another approach was plotting the Overview table for each ID with the minutes per month/minutes per day overview and colours mapped to the type (blue=data traffic, green=prox-in-building, red=prox-in-classified, yellow=prox-out-classified; see Picture3). In several cases, a blue square appears between a red and yellow one, which means, the ID's PC had been used while he/she was in the classified area.

To make sure we found all suspicious moments manually, we wrote program 1 to support our findings (see first part, first four rows, Picture4). Two IDs that logged into the classified area without logging out later (ID 38 on the 4th at 13:12, ID 49 on the 8th at 12:56) were discovered. So we wrote program 2 that detected three further exceptions: ID 30 logged out without logging in before (on the 10th at 10:33, 17th at 11:31 and 24th at 9:00).

In this part of the process, everything but the detection of anomalies in the plots was achieved automatically, which took about two and a half hours.

 

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/traffic/traffic_8/UKN-KNIME-MC1/UKN-KNIME-MC1/Picture2.png
Figure 2:Scatter plot overview of employees’ data traffic.

 

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/traffic/traffic_8/UKN-KNIME-MC1/UKN-KNIME-MC1/Picture3.png
Figure 3: Scatter plot overview of persons behavior: blue=data traffic, green=prox-in-building, red=prox-in-classified, yellow=prox-out-classified.

 

 

All suspicious data traffic we found had destination IP 100.59.151.133 so that probably all traffic to it is suspicious (see first three columns, Picture 4).

We took a closer look at these occasions.

Looking at the owners of suspicious PCs, their office neighbors and their (probable) behavior during the time suspicious data traffic occurred (manually, using minutes per month/minutes per day plots), we found out that it was possible to give reasons for the employees' absence in most cases (see Figure 4). As the traitor doesn't want to be detected, he/she won't have used an office where anyone was present. However, on two occasions ID 30 is there and active while his neighbor’s PC is used.

We know that being in the classified area is quite a good alibi, so we (manually, for all IDs) counted the cases, in which an employee had been in the classified area while suspicious data traffic occurred (see upper half, Figure 5). Only ID 27 and 30 never have alibis which makes them highly suspicious. Knowing that it is not impossible to sneak into or out of the classified area, this did not give us definite results.

ID 30's behavior on the 8th and 15th led us to counting (manually again) in how many cases each employee had been active in a one- and two-minute interval around the suspicious data traffic (see lower half, Figure 5). We could see that ID 30 was extremely active in the two-minute intervals (nearly twice as active as any other employee) and concluded that he/she had tried to „fake“ his/her presence in his own office by having data traffic shortly after leaking confidential information. Now, was it possible for ID 30 to know when which office was empty? A look at the office plan reveals that office 15 (ID 30 and 31) offers a good view over most of the affected offices and the corridor to the classified area.

This took us about 2 hours.

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/traffic/traffic_8/UKN-KNIME-MC1/UKN-KNIME-MC1/Picture4.png
Figure 4: Listing of suspicious behavior.

 

 

We can now detect clear patterns in 30's behavior.

ID 30 has short gaps in his/her own data traffic plot shortly before each time suspicious data traffic occurs, but is often active when he/she is done with his/her malicious conducts. We imagine him/her preparing some kind of data traffic on his own PC before.

Apart from that, he/she began slowly with one sending per day, then two, later three. Three of the first five times were even carried out from his/her own office but he/she became more careful later and used different offices.

If one divides the suspicious data traffic like in Figure 5, one group for PCs that have been used while the employees were in the classified area and one for the rest, a clear pattern is visible. While the first group events are spread over the whole day, the others mainly take place in the morning and evening, when many employees are not yet there or have already gone.

Furthermore, all data was sent on Tuesdays and Thursdays.

Here, short looks at different plots (manually, all in all less than half an hour) are enough to detect anomalies; while a deeper investigation of ID 30's data traffic compared to the rest does not give any other results than those already mentioned: transmitting to 100.59.151.133.

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/traffic/traffic_8/UKN-KNIME-MC1/UKN-KNIME-MC1/Picture5.png

Figure 5.Final listing of suspicious behavior with 1-2 minutes before and after a criminal act.

 

Social Network and Geographic Analysis (Flitter Data)

 

The Flitter data was analyzed using a visual analytics approach as described below

 

 

pipeline

    Figure 6 – The pipeline we used for our analysis: First a data selection and aggregation is made.
             After that there is an iterative visualization approach.

 

Selection and Preprocessing

We started our analysis by getting familiar with the data and writing down the constraints for every scenario, breaking them up into sections that we judged as necessary, possible and merely speculative. The data was inserted into a MySQL Database using Navicat Lite. Then we designed an aggregated table with all the connection information that was given, e.g. the exact geo location on the map and the connection count of the users by using a small PHP script, which we wrote in 1 hour. The connection data itself was loaded into Pajek by using the txt2pajek helper tool.

We initially visualized the complete graph by using Pajek’s force directed layout algorithms and started to reduce the network into a Pajek partition in which vertices are colored according to the connection count. The result was still a much cluttered view so we decided to use more constraints to get rid of useless information.

To do that we first defined four classes (employee, handler, middleman and fearless leader) and assigned the persons to the classes according to their connection counts. Based on these classes we added further constraint first with SQL statements, and later we developed a lightweight java tool to structure the process of adding constraints.

 

Visual Analytics Approach

At first the analysis was lead by the idea to concentrate on the scenario with more information available and easier constraints which is clearly scenario A. It appeared that scenario B was not supported directly by the data considering the fixed values for connection information of the middlemen (which would be 2-3 contacts). The only possibility according to scenario B was that the middlemen have contact to more than one of the handlers.

We concentrated on scenario A first and used the given constraints to reduce the dataset. The critical point was to check which user of the class employee had connection to at least 3 persons of the class handler and if all of the handlers had contact to someone with 4-5 contacts. The one with the codename Boris had also to have contact to the fearless leader having a connection count of over 100.

We wrote our java tool in an iterative process which took as about 6 hours. In each step of the process we added a new constraint and then visualized the results with the help of Pajek. Some constraints e.g. that the handlers are not allowed to communicate among themselves were not included, because this could be easily seen in the visualization. As a result we got exactly one network that matched the given constraints of scenario A.

The next step was to add the tool support for scenario B. We checked again which user of the class employee had connection to at least 3 persons of the class handler. But this time it was possible, that each handler has his own middleman with 2-4 contacts. These middlemen had to have contact to one potential leader. In the end we saw no evidence in the data, that scenario B would match.

By mapping the network structure on the map of Flovania, we realized that the fearless leader didn’t live in a larger city. But because this geospatial implication was mentioned in the task description we decided to validate the result again.

In order to do that, we used SQL statements and visualizations. We started again by looking which employee has connections to at least 3 handlers. This led to only 13 potential employees. Than we queried for the connections to potential handlers, middleman and leaders and visualized the result set for each potential employee separately.

In figure 7 you can see the visualization of the employee with the ID 19. This network structure nearly matches the constraints of scenario B. You can see four handlers connected to one employee. The drawback is that there are only two handlers whose two middlemen have contact to one leader. In our analysis we found no matching structure for scenario B at all.

19.bmp

Figure 7 – Network structure of employee with id 19

 

By visualizing the network for the employee with ID 100 (figure 8) it is easy to see that it fits the network structure of scenario A. There is one employee connected to 3 handlers and they are connected to one Middleman, who is related to the leader. This is the only matching structure we found in the data. This mainly manually made analysis for the 13 employees, took us about 2 hours.

 

100.bmp

Figure 8 – Network structure of employee with id 100

 

4.    Result

To visualize our final result we took the detected employee, the three handlers, the middleman and the fearless leader and queried for all connections between these persons. Also we added all international contacts of the fearless leader and the contact of the middleman Boris to the not jet mentioned member of the organization. In figure 9 you can see our final network, which seems to be the best matching for the task.

result2.png

Figure 9 - The complete resulting network of the criminal organization.

 

We believe that the person whose ID is 100 is the employee and the persons with the IDs 194, 261 and 563 are his handlers. As the three handlers have contact with only one person of the group of persons with 4 or 5 contacts, this person has to be the middleman Boris. Boris has the ID 4994. And also Boris has only one contact to the group of persons with over 100 contacts. There is a contact to the person whose ID is 4. This person seems to be the Fearless Leader. These entire IDs we get with the help of our own written tool. Furthermore we found one person who is linked with Boris and so it is very probable, that the person whose ID is 1612 is also a member of the organization.

 

Video Analysis

 

1. Assumptions

To identify any events of potential counter intelligence/espionage interest a definition of such an suspicious event needs to be given. Following events were defined as suspicious:

These events need to be describe in a formal way with behavioral pattern. To recognize such an event following dimensions are to be considered as well:

Suspicious item are as followed:

Suspicious areas are as followed:

Those areas specify the areas of interest. In order to determine events, items within an area have to be recognized. Therefore, areas of movement within the video need to be recognized since every moving object may be indicating suspicious deeds. Those areas of movement are to be marked and classified, see figure 10.

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/video/video_23/UKN-AVA-VAT-MC3/pics/big/classifying.jpg
Figure 10: Classification needs interactive user involvement



Following types are considered as potential suspicious and need to be determined.

All other moving areas are not to be considered as suspicious. This means those areas are irrelevant and can be excluded.

 

2. Analysis

 

To analyze the video data, an interactive process based on the KDD (Knowledge Discovery in Databases) - pipeline was used, as shown in figure 11.

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/video/video_23/UKN-AVA-VAT-MC3/pics/big/infoViss.jpg
Figure 11: KDD pipeline (Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. From Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge Discovery and Data Mining. (1996), 1-34., http://www.aaai.org/aitopics/assets/PDF/AIMag17-03-2-article.pdf)



Following this terminology, a flow chart (Figure 12) was created describing the operative steps required to conduct a successful analysis of video stream data.

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/video/video_23/UKN-AVA-VAT-MC3/pics/big/Timeline.jpg
Figure 12: Analysis process of video data

 

In order to extract the data from the video, thresholds have to be set manually by the user. The most important thresholds are as follows.

 

2.1 Determination of bounding boxes

As the result of the determination chain the bounding boxes inside a frame are determined. Figure 1 shows the result of the automatic determination of bounding boxes. The color bar below the frame preview shows the count of bounding boxes over the time, each line stands for one location, starting with the first one. The lighter the color is, the more bounding boxes were found in this time slot. Each movement of the camera position indicates a change of location. This leads to the location information relevant for the result.

 

2.2 Classification of Bounding Boxes

The process of classification is an interactive process divided into two sub processes.

  1. manual classification of a subset of bounding boxes (training)
  2. automatic classification of the remaining bounding boxes using Neural Networks Algorithm (Multi Layer Perception Predictor) or Decision Tree Predictor

 

No.

name

color

R

G

B

1

human

green

77

157

74

2

Two humans

orange

255

127

0

3

car

red

228

26

28

4

Two cars

blue

126

126

184

Table 1: Mapped colors to the classified bounding boxes for visualization

 

This trained data will be used for the next step (Multi Layer Perception Predictor, Decision Tree Predictor) as picture 13 shows. At the end of this step a table containing all bounding boxes of one sub-video results.

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/video/video_23/UKN-AVA-VAT-MC3/pics/big/training_to_mlp.jpg
Figure 13: Classification needs user interaction and computing using prediction algorithms



 

3. Determination of Suspicious Events

Once the patterns have been recognized, the suspicious events can be reviewed by visualization of the patterns, as shown in Figure 14.

 

http://vastsubmission.cs.uml.edu/ChallengeSubmissions/2009/UKN-AVA/video/video_23/UKN-AVA-VAT-MC3/pics/big/review_01.jpg
Figure 14: patterns can be verified manually and marked to be exported to a result table



4. Result

4.1 Suspicious Events

As a result the most relevant pattern is

This means also, that two persons may walk down a street together (implicates previous meeting).

The pattern:

needs to be redefined for another run since too many events were found.

 

4.2 Performance Comparison of Automatic and Interactive Parts

Performance is assessed for the interactive and automatic parts of the process chain. Process times for the user as well as for the hardware (server, pc) are listed separately in the table 2.

 

No.

Process step

time in Min (user)

time in Min (HW)

1

Frame Extraction

0

180 - 360

2

Set up Thresholds

5 - 15

5 - 15

3

Determination of bounding boxes

0

180 - 240

4

Classifying of a subset of bounding boxes

5 - 15

5 - 15

5

Filtering Boxes 10

<1

<1

6

Visualisation

0

<1

7

Pattern Recognition

0

<1

8

Pattern Recognition Review

5 - 30

0

Table 2: comparison of user and hardware process times for video 1

 

4.3 Data Reduction

Table 3 shows the reduction of data for video 1. The

final relevant data was reduced to 0,008 % of the potential relevant data.

 

No.

Process step

count of table rows input

count of table rows output

1

Frame Extraction

0

0

2

Set up Thresholds

0

0

3

Determination of bounding boxes

0

143528

4

Classifying of a subset of bounding boxes

143528

143528

5

Filtering bounding boxes

143528

93865

6

Visualization

93865

93865

7

Pattern Recognition

93865

3859

8

Pattern Recognition Review

3859

12

Table 3: Data reduction of video one leads to relevant events

 

4.4 Conclusion

Compared to the complete video time (4h) the user interaction takes between 25 to 70 minutes. The VAT-Tool enables an analyst to focus her/his attention on a limited amount of automatically preselected events, while it would otherwise be very difficult and exhausting to attentively watch the whole videos with several hours of duration.